Open In Colab

In [1]:
!pip install pandas-profiling==2.7.1
Requirement already satisfied: pandas-profiling==2.7.1 in /usr/local/lib/python3.6/dist-packages (2.7.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (1.18.5)
Requirement already satisfied: confuse>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (1.3.0)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.0.6)
Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (1.4.1)
Requirement already satisfied: missingno>=0.4.2 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.4.2)
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (2.11.2)
Requirement already satisfied: visions[type_image_path]==0.4.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.4.1)
Requirement already satisfied: phik>=0.9.10 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.10.0)
Requirement already satisfied: ipywidgets>=7.5.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (7.5.1)
Requirement already satisfied: matplotlib>=3.2.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (3.2.2)
Requirement already satisfied: htmlmin>=0.1.12 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.1.12)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (0.16.0)
Requirement already satisfied: requests>=2.23.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (2.23.0)
Requirement already satisfied: tqdm>=4.43.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (4.49.0)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (1.0.5)
Requirement already satisfied: astropy>=4.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling==2.7.1) (4.0.1.post1)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from confuse>=1.0.0->pandas-profiling==2.7.1) (3.13)
Requirement already satisfied: seaborn in /usr/local/lib/python3.6/dist-packages (from missingno>=0.4.2->pandas-profiling==2.7.1) (0.10.1)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.11.1->pandas-profiling==2.7.1) (1.1.1)
Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (20.2.0)
Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (2.5)
Requirement already satisfied: imagehash; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (4.1.0)
Requirement already satisfied: Pillow; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (7.0.0)
Requirement already satisfied: numba>=0.38.1 in /usr/local/lib/python3.6/dist-packages (from phik>=0.9.10->pandas-profiling==2.7.1) (0.48.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling==2.7.1) (4.3.3)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling==2.7.1) (4.10.1)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.5.0)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling==2.7.1) (3.5.1)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.0.7)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling==2.7.1) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling==2.7.1) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling==2.7.1) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling==2.7.1) (1.2.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling==2.7.1) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling==2.7.1) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling==2.7.1) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.23.0->pandas-profiling==2.7.1) (2020.6.20)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->pandas-profiling==2.7.1) (2018.9)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from networkx>=2.4->visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (4.4.2)
Requirement already satisfied: PyWavelets in /usr/local/lib/python3.6/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (1.1.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.4.1->pandas-profiling==2.7.1) (1.15.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling==2.7.1) (50.3.0)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling==2.7.1) (0.31.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.2.0)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.3.5)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.6/dist-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.1.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (4.8.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.7.5)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (2.6.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (1.0.18)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.8.1)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.3.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (4.6.3)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (19.0.2)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.6.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.2.5)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (5.6.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (1.5.0)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.8.3)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.8.4)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (3.1.5)
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.4.4)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.6.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (1.4.2)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.3)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (0.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling==2.7.1) (20.4)

Librerías a usar

In [2]:
import pandas as pd
import numpy as np
import scipy.stats  as stats
from scipy.stats import chi2
from scipy.stats import chi2_contingency

import requests
import zipfile

import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("ggplot")

import pandas_profiling
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Descarga de datos

In [3]:
def download_content(url):
    r = requests.get(url)
    with open('bank-additional.zip', "wb") as f:
        f.write(r.content)
    return
download_content('https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip')

with zipfile.ZipFile('bank-additional.zip', 'r') as zip_ref:
    zip_ref.extractall('zip')

Lectura de datos

In [4]:
df = pd.read_csv('zip/bank-additional/bank-additional-full.csv',delimiter=";")
df.head()
Out[4]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

Generación de reporte

In [5]:
pandas_profiling.ProfileReport(df)



Out[5]:

Entendimiento del negocio

Los datos están relacionados con campañas de marketing de una entidad financiera portuguesa. Las campañas de marketing se basaron en llamadas a teléfono y celular. El dataframe con tiene 41188 observaciones y 21 variables, de estas 10 son categóricas, 10 numéricas y 1 tipo boolean o binaria. Las variables contienen información demográfica del cliente, información relacionada con el último contacto de la última campaña, información del contexto social y económico e información del cliente con respecto a otras campañas.

En este link se puede encontrar información más detallada respecto a las variables

Proceso de limpieza

  • Los tipos de las variables fueron identificados correctamente por pandas al cargar el archivo
  • Hay 12 filas duplicadas, dado el contexto el problema no se deben eliminar pues pueden corresponder a llamadas realizadas al mismo cliente.
  • Todos los datos están completos
  • La mayoria de los datos representan la realidad de acuerdo a la información de las variables en el link. Sin embargo, hay variables que tienen unknown en sus valores posibles y en pequeña cantidad, estas variables corresponden a información demográfica del cliente por lo que para efectos del objetivo del proyecto es importante eliminar estas filas con estos valores.
  • Las variables que no son de interés son aquellas que están relacionadas con el mes, duración del contacto (llamada) y días transcurridos entre la llamada de una campaña previa y la actual pues las dos primeras no son útiles para plantear alguna hipótesis y para la última, mejor que saber cuantos días transcurrieron es saber el número de contactos previos (previous). Asimismo, la variable default dada la gran cantidad de valores unknown no es de interés para las hipótesis.

Tipos de datos

In [6]:
df.dtypes
Out[6]:
age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object
In [6]:
 

Duplicados

In [7]:
df[df.duplicated()==True]
Out[7]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
1266 39 blue-collar married basic.6y no no no telephone may thu 124 1 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
12261 36 retired married unknown no no no telephone jul thu 88 1 999 0 nonexistent 1.4 93.918 -42.7 4.966 5228.1 no
14234 27 technician single professional.course no no no cellular jul mon 331 2 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1 no
16956 47 technician divorced high.school no yes no cellular jul thu 43 3 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1 no
18465 32 technician single professional.course no yes no cellular jul thu 128 1 999 0 nonexistent 1.4 93.918 -42.7 4.968 5228.1 no
20216 55 services married high.school unknown no no cellular aug mon 33 1 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1 no
20534 41 technician married professional.course no yes no cellular aug tue 127 1 999 0 nonexistent 1.4 93.444 -36.1 4.966 5228.1 no
25217 39 admin. married university.degree no no no cellular nov tue 123 2 999 0 nonexistent -0.1 93.200 -42.0 4.153 5195.8 no
28477 24 services single high.school no yes no cellular apr tue 114 1 999 0 nonexistent -1.8 93.075 -47.1 1.423 5099.1 no
32516 35 admin. married university.degree no yes no cellular may fri 348 4 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
36951 45 admin. married university.degree no no no cellular jul thu 252 1 999 0 nonexistent -2.9 92.469 -33.6 1.072 5076.2 yes
38281 71 retired single university.degree no no no telephone oct tue 120 1 999 0 nonexistent -3.4 92.431 -26.9 0.742 5017.5 no

Completitud

In [8]:
def completitud(df):
  mis_val = df.isnull().sum()
  mis_val_percent = 100 * (1 - (df.isnull().sum()/len(df)))
  mis_val_percent.rename_axis('Atributo',inplace=True)
  return mis_val_percent
completitud(df)
Out[8]:
Atributo
age               100.0
job               100.0
marital           100.0
education         100.0
default           100.0
housing           100.0
loan              100.0
contact           100.0
month             100.0
day_of_week       100.0
duration          100.0
campaign          100.0
pdays             100.0
previous          100.0
poutcome          100.0
emp.var.rate      100.0
cons.price.idx    100.0
cons.conf.idx     100.0
euribor3m         100.0
nr.employed       100.0
y                 100.0
dtype: float64

Exactitud

Podemos ver que las variables job, marital, education, default, housing y loan contienen valores "unknown" por lo que es mejor eliminar las filas que contengan estos valores siempre y cuando tengan pocas filas que los incluyan.

In [9]:
for columna in df.columns:
  print(columna,"\n",df[columna].unique())
age 
 [56 57 37 40 45 59 41 24 25 29 35 54 46 50 39 30 55 49 34 52 58 32 38 44
 42 60 53 47 51 48 33 31 43 36 28 27 26 22 23 20 21 61 19 18 70 66 76 67
 73 88 95 77 68 75 63 80 62 65 72 82 64 71 69 78 85 79 83 81 74 17 87 91
 86 98 94 84 92 89]
job 
 ['housemaid' 'services' 'admin.' 'blue-collar' 'technician' 'retired'
 'management' 'unemployed' 'self-employed' 'unknown' 'entrepreneur'
 'student']
marital 
 ['married' 'single' 'divorced' 'unknown']
education 
 ['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course'
 'unknown' 'university.degree' 'illiterate']
default 
 ['no' 'unknown' 'yes']
housing 
 ['no' 'yes' 'unknown']
loan 
 ['no' 'yes' 'unknown']
contact 
 ['telephone' 'cellular']
month 
 ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep']
day_of_week 
 ['mon' 'tue' 'wed' 'thu' 'fri']
duration 
 [ 261  149  226 ... 1246 1556 1868]
campaign 
 [ 1  2  3  4  5  6  7  8  9 10 11 12 13 19 18 23 14 22 25 16 17 15 20 56
 39 35 42 28 26 27 32 21 24 29 31 30 41 37 40 33 34 43]
pdays 
 [999   6   4   3   5   1   0  10   7   8   9  11   2  12  13  14  15  16
  21  17  18  22  25  26  19  27  20]
previous 
 [0 1 2 3 4 5 6 7]
poutcome 
 ['nonexistent' 'failure' 'success']
emp.var.rate 
 [ 1.1  1.4 -0.1 -0.2 -1.8 -2.9 -3.4 -3.  -1.7 -1.1]
cons.price.idx 
 [93.994 94.465 93.918 93.444 93.798 93.2   92.756 92.843 93.075 92.893
 92.963 92.469 92.201 92.379 92.431 92.649 92.713 93.369 93.749 93.876
 94.055 94.215 94.027 94.199 94.601 94.767]
cons.conf.idx 
 [-36.4 -41.8 -42.7 -36.1 -40.4 -42.  -45.9 -50.  -47.1 -46.2 -40.8 -33.6
 -31.4 -29.8 -26.9 -30.1 -33.  -34.8 -34.6 -40.  -39.8 -40.3 -38.3 -37.5
 -49.5 -50.8]
euribor3m 
 [4.857 4.856 4.855 4.859 4.86  4.858 4.864 4.865 4.866 4.967 4.961 4.959
 4.958 4.96  4.962 4.955 4.947 4.956 4.966 4.963 4.957 4.968 4.97  4.965
 4.964 5.045 5.    4.936 4.921 4.918 4.912 4.827 4.794 4.76  4.733 4.7
 4.663 4.592 4.474 4.406 4.343 4.286 4.245 4.223 4.191 4.153 4.12  4.076
 4.021 3.901 3.879 3.853 3.816 3.743 3.669 3.563 3.488 3.428 3.329 3.282
 3.053 1.811 1.799 1.778 1.757 1.726 1.703 1.687 1.663 1.65  1.64  1.629
 1.614 1.602 1.584 1.574 1.56  1.556 1.548 1.538 1.531 1.52  1.51  1.498
 1.483 1.479 1.466 1.453 1.445 1.435 1.423 1.415 1.41  1.405 1.406 1.4
 1.392 1.384 1.372 1.365 1.354 1.344 1.334 1.327 1.313 1.299 1.291 1.281
 1.266 1.25  1.244 1.259 1.264 1.27  1.262 1.26  1.268 1.286 1.252 1.235
 1.224 1.215 1.206 1.099 1.085 1.072 1.059 1.048 1.044 1.029 1.018 1.007
 0.996 0.979 0.969 0.944 0.937 0.933 0.927 0.921 0.914 0.908 0.903 0.899
 0.884 0.883 0.881 0.879 0.873 0.869 0.861 0.859 0.854 0.851 0.849 0.843
 0.838 0.834 0.829 0.825 0.821 0.819 0.813 0.809 0.803 0.797 0.788 0.781
 0.778 0.773 0.771 0.77  0.768 0.766 0.762 0.755 0.749 0.743 0.741 0.739
 0.75  0.753 0.754 0.752 0.744 0.74  0.742 0.737 0.735 0.733 0.73  0.731
 0.728 0.724 0.722 0.72  0.719 0.716 0.715 0.714 0.718 0.721 0.717 0.712
 0.71  0.709 0.708 0.706 0.707 0.7   0.655 0.654 0.653 0.652 0.651 0.65
 0.649 0.646 0.644 0.643 0.639 0.637 0.635 0.636 0.634 0.638 0.64  0.642
 0.645 0.659 0.663 0.668 0.672 0.677 0.682 0.683 0.684 0.685 0.688 0.69
 0.692 0.695 0.697 0.699 0.701 0.702 0.704 0.711 0.713 0.723 0.727 0.729
 0.732 0.748 0.761 0.767 0.782 0.79  0.793 0.802 0.81  0.822 0.827 0.835
 0.84  0.846 0.87  0.876 0.885 0.889 0.893 0.896 0.898 0.9   0.904 0.905
 0.895 0.894 0.891 0.89  0.888 0.886 0.882 0.88  0.878 0.877 0.942 0.953
 0.956 0.959 0.965 0.972 0.977 0.982 0.985 0.987 0.993 1.    1.008 1.016
 1.025 1.032 1.037 1.043 1.045 1.047 1.05  1.049 1.046 1.041 1.04  1.039
 1.035 1.03  1.031 1.028]
nr.employed 
 [5191.  5228.1 5195.8 5176.3 5099.1 5076.2 5017.5 5023.5 5008.7 4991.6
 4963.6]
y 
 ['no' 'yes']

Veremos la cantidad de filas que tienen unknown en los valores de las columnas job, marital, education, housing, loan y default para determinar si se pueden eliminar o no algunas filas.

In [10]:
cols = ['job','marital','education','housing','loan','default']
for col in cols:
  print(pd.value_counts(df[col]))
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64
married     24928
single      11568
divorced     4612
unknown        80
Name: marital, dtype: int64
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: education, dtype: int64
yes        21576
no         18622
unknown      990
Name: housing, dtype: int64
no         33950
yes         6248
unknown      990
Name: loan, dtype: int64
no         32588
unknown     8597
yes            3
Name: default, dtype: int64
In [11]:
cols = ['job','marital','education','housing','loan']
for col in cols:
  df = df[df[col]!='unknown']
print(f"Quedan {df.shape[0]} filas y {df.shape[1]} columnas")
df.head()
Quedan 38245 filas y 21 columnas
Out[11]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

Variables de interés

In [12]:
df.drop(columns=['pdays','default','month','duration'], inplace=True)
df.head()
Out[12]:
age job marital education housing loan contact day_of_week campaign previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no telephone mon 1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school no no telephone mon 1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school yes no telephone mon 1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no telephone mon 1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no yes telephone mon 1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

Hipótesis

Las hipótesis a plantear están relacionadas con una caracterización de la población de manera que sea más fácil identificar la población objetivo para las campañas futuras basados en la aceptación de campañas previas.

  • Hipótesis 1: El nivel educativo está relacionado con la aceptación de estas campañas por parte de los clientes, entre más alto mayor aceptación puede haber.

  • Hipótesis 2: El estado civil está relacionado con la aceptación de estas campañas por parte de los clientes, entre más comprometido mayor aceptación puede haber.

  • Hipótesis 3: El tener préstamo de vivienda está relacionado negativamente con la aceptación de estas campañas por parte de los clientes.

  • Hipótesis 4: El tener préstamo a nivel personal está relacionado negativamente con la aceptación de estas campañas por parte de los clientes.

In [13]:
def freq_total(col,df):
  Tabla = pd.crosstab(index=df[col],columns=df["y"],margins=True)
  cols = list(Tabla.columns)
  cols[-1] = "Total_"+col
  Tabla.columns = cols
  idxs = list(Tabla.index)
  idxs[-1] = "Total_y"
  Tabla.index = idxs
  return Tabla/Tabla.loc['Total_y','Total_'+col]

def freq_relativCol(col,df):
  Tabla = pd.crosstab(index=df[col],columns=df["y"],margins=True)
  cols = list(Tabla.columns)
  cols[-1] = "Total_"+col
  Tabla.columns = cols
  idxs = list(Tabla.index)
  idxs[-1] = "Total_y"
  Tabla.index = idxs
  return Tabla.div(Tabla.loc['Total_y',:],axis=1)

def freq_relativFil(col,df):
  Tabla = pd.crosstab(index=df[col],columns=df["y"],margins=True)
  cols = list(Tabla.columns)
  cols[-1] = "Total_"+col
  Tabla.columns = cols
  idxs = list(Tabla.index)
  idxs[-1] = "Total_y"
  Tabla.index = idxs
  return Tabla.div(Tabla["Total_"+col], axis=0)

Hipótesis 1

Frecuencias relativas totales

In [14]:
freq_total('education',df)
Out[14]:
no yes Total_education
basic.4y 0.093868 0.010773 0.104641
basic.6y 0.052922 0.004706 0.057628
basic.9y 0.141169 0.011949 0.153118
high.school 0.215375 0.026330 0.241705
illiterate 0.000366 0.000105 0.000471
professional.course 0.118238 0.015113 0.133351
university.degree 0.266728 0.042358 0.309086
Total_y 0.888665 0.111335 1.000000

Frecuencias relativas a la columna

In [15]:
freq_relativCol('education',df)
Out[15]:
no yes Total_education
basic.4y 0.105629 0.096759 0.104641
basic.6y 0.059552 0.042273 0.057628
basic.9y 0.158855 0.107327 0.153118
high.school 0.242357 0.236496 0.241705
illiterate 0.000412 0.000939 0.000471
professional.course 0.133051 0.135744 0.133351
university.degree 0.300144 0.380460 0.309086
Total_y 1.000000 1.000000 1.000000

Frecuencias relativas a la fila

In [16]:
Tabla = freq_relativFil('education',df)
Tabla
Out[16]:
no yes Total_education
basic.4y 0.897051 0.102949 1.0
basic.6y 0.918330 0.081670 1.0
basic.9y 0.921960 0.078040 1.0
high.school 0.891064 0.108936 1.0
illiterate 0.777778 0.222222 1.0
professional.course 0.886667 0.113333 1.0
university.degree 0.862956 0.137044 1.0
Total_y 0.888665 0.111335 1.0
In [17]:
ax = (Tabla[["no", "yes"]]).plot(kind='bar',figsize=(15,4),width = 0.8,edgecolor=None)
plt.legend(labels=Tabla.columns,fontsize= 14)
plt.title("Porcentaje de aceptación por nivel educativo",fontsize= 16)

plt.xticks(fontsize=14, rotation=45)
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.yticks([])

# Add this loop to add the annotations
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x, y + height + 0.01))

Parece que la población perteneciente a un nivel educativo superior o igual a high_school son los más propensos a aceptar respecto a los que no. Sin embargo, los iletrados son los miembros del nivel educativo que más aceptan.

Hipótesis 2

Frecuencias relativas totales

In [18]:
freq_total('marital',df)
Out[18]:
no yes Total_marital
divorced 0.100771 0.011714 0.112485
married 0.544960 0.061211 0.606171
single 0.242934 0.038410 0.281344
Total_y 0.888665 0.111335 1.000000

Frecuencias relativas a la columna

In [19]:
freq_relativCol('marital',df)
Out[19]:
no yes Total_marital
divorced 0.113396 0.105214 0.112485
married 0.613234 0.549789 0.606171
single 0.273369 0.344998 0.281344
Total_y 1.000000 1.000000 1.000000

Frecuencias relativas a la fila

In [20]:
Tabla = freq_relativFil('marital',df)
Tabla
Out[20]:
no yes Total_marital
divorced 0.895862 0.104138 1.0
married 0.899021 0.100979 1.0
single 0.863476 0.136524 1.0
Total_y 0.888665 0.111335 1.0
In [21]:
ax = (Tabla[["no", "yes"]]).plot(kind='bar',figsize=(15,4),width = 0.8,edgecolor=None)
plt.legend(labels=Tabla.columns,fontsize= 14)
plt.title("Porcentaje de aceptación por estado civil",fontsize= 16)

plt.xticks(fontsize=14, rotation=0)
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.yticks([])

# Add this loop to add the annotations
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x, y + height + 0.01))

Parece que las personas casadas o divorciadas aceptan en igual medida pero menos con respecto a los solteros.

Hipótesis 3

Frecuencias relativas totales

In [22]:
freq_total('housing',df)
Out[22]:
no yes Total_housing
no 0.412080 0.049863 0.461943
yes 0.476585 0.061472 0.538057
Total_y 0.888665 0.111335 1.000000

Frecuencias relativas a la columna

In [23]:
freq_relativCol('housing',df)
Out[23]:
no yes Total_housing
no 0.463707 0.447863 0.461943
yes 0.536293 0.552137 0.538057
Total_y 1.000000 1.000000 1.000000

Frecuencias relativas a la fila

In [24]:
Tabla = freq_relativFil('housing',df)
Tabla
Out[24]:
no yes Total_housing
no 0.892059 0.107941 1.0
yes 0.885752 0.114248 1.0
Total_y 0.888665 0.111335 1.0
In [25]:
ax = (Tabla[["no", "yes"]]).plot(kind='bar',figsize=(15,4),width = 0.8,edgecolor=None)
plt.legend(labels=Tabla.columns,fontsize= 14)
plt.title("Porcentaje de aceptación por préstamo de vivienda",fontsize= 16)

plt.xticks(fontsize=14, rotation=0)
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.yticks([])

# Add this loop to add the annotations
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x, y + height + 0.01))

Parece que el tener un préstamo de vivienda o no no afecta la aceptación

Hipótesis 4

Frecuencias relativas totales

In [26]:
freq_total('loan',df)
Out[26]:
no yes Total_loan
no 0.749562 0.094627 0.844189
yes 0.139103 0.016708 0.155811
Total_y 0.888665 0.111335 1.000000

Frecuencias relativas a la columna

In [27]:
freq_relativCol('loan',df)
Out[27]:
no yes Total_loan
no 0.84347 0.84993 0.844189
yes 0.15653 0.15007 0.155811
Total_y 1.00000 1.00000 1.000000

Frecuencias relativas a la fila

In [28]:
Tabla = freq_relativFil('loan',df)
Tabla
Out[28]:
no yes Total_loan
no 0.887908 0.112092 1.0
yes 0.892767 0.107233 1.0
Total_y 0.888665 0.111335 1.0
In [29]:
ax = (Tabla[["no", "yes"]]).plot(kind='bar',figsize=(15,4),width = 0.8,edgecolor=None)
plt.legend(labels=Tabla.columns,fontsize= 14)
plt.title("Porcentaje de aceptación por préstamo personal",fontsize= 16)

plt.xticks(fontsize=14, rotation=0)
for spine in plt.gca().spines.values():
    spine.set_visible(False)
plt.yticks([])

# Add this loop to add the annotations
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate('{:.0%}'.format(height), (x, y + height + 0.01))

Parece que el tener un préstamo personal no afecta la aceptación.

Validación hipótesis

Las hipótesis a validar seleccionadas son las dos primeras

  • Hipótesis 1: El nivel educativo está relacionado con la aceptación de estas campañas por parte de los clientes, entre más alto mayor aceptación puede haber.

  • Hipótesis 2: El estado civil está relacionado con la aceptación de estas campañas por parte de los clientes, entre más comprometido mayor aceptación puede haber.

In [30]:
def contingencia(col,df):
  Tabla = pd.crosstab(index=df[col],columns=df["y"])
  return Tabla

def chiquad(Tabla, prob = 0.95):
  estadistico, p_value, grados_libertad, freq_esperadas = chi2_contingency(Tabla)
  print(f"El estadístico de prueba es {estadistico}, el p_value es {p_value}, los grados de libertad {grados_libertad} y las frecuencias esperadas son:\n {freq_esperadas}")
  print("Interpretando p_value con probabilidad del {}%".format(prob*100))
  alpha = 1.0 - prob
  if p_value <= alpha:
    print('Dependientes (Se rechaza H0)')
  else:
    print('No son dependientes (No se puede rechazar H0)')
  print("Interpretando estadístico de prueba con probabilidad del {0}% y {1} grados de libertad".format(prob*100,grados_libertad))
  # interpret test-statistic
  critical = chi2.ppf(prob, grados_libertad)
  if abs(estadistico) >= critical:
    print('Dependientes (Se rechaza H0)')
  else:
    print('No son dependientes (No se puede rechazar H0)')

Hipótesis 1

H0: Las variables education y y son independientes

H1: Las variables education y y no son independientes

In [31]:
Tabla = contingencia("education",df)
chiquad(Tabla)
El estadístico de prueba es 170.01425619488032, el p_value es 4.4665371241673025e-34, los grados de libertad 6 y las frecuencias esperadas son:
 [[3.55643807e+03 4.45561930e+02]
 [1.95861807e+03 2.45381932e+02]
 [5.20402332e+03 6.51976677e+02]
 [8.21482097e+03 1.02917903e+03]
 [1.59959733e+01 2.00402667e+00]
 [4.53219244e+03 5.67807557e+02]
 [1.05049112e+04 1.31608885e+03]]
Interpretando p_value con probabilidad del 95.0%
Dependientes (Se rechaza H0)
Interpretando estadístico de prueba con probabilidad del 95.0% y 6 grados de libertad
Dependientes (Se rechaza H0)

Las variables education y y no son independientes

Hipótesis 2

H0: Las variables marital y y son independientes

H1: Las variables marital y y no son independientes

In [32]:
Tabla = contingencia("marital",df)
chiquad(Tabla)
El estadístico de prueba es 96.38458503893935, el p_value es 1.1758536958018105e-21, los grados de libertad 2 y las frecuencias esperadas son:
 [[ 3823.03762583   478.96237417]
 [20601.92498366  2581.07501634]
 [ 9562.03739051  1197.96260949]]
Interpretando p_value con probabilidad del 95.0%
Dependientes (Se rechaza H0)
Interpretando estadístico de prueba con probabilidad del 95.0% y 2 grados de libertad
Dependientes (Se rechaza H0)

Las variables marital y y no son independientes

Estrategia propuesta

La estrategia propuesta dada la relación de las variables education y marital con la variable y (aceptación de la suscripción al depósito) debe ser una campaña dirigida a personas iletradas o con nivel educativo superior o igual a high school y que sean solteras aunque también pueden ser personas casadas y divorciadas pues la proporción de personas que aceptaron previamente no varia mucho entre los estados civiles anteriormente mencionados.